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Abstract 

This article addresses the issue of representing electroencephalographic (EEG) signals in an efficient way. While classical ap- 
proaches use a fixed Gabor dictionary to analyze EEG signals, this article proposes a data-driven method to obtain an adapted 
CO dictionary. To reach an efficient dictionary learning, appropriate spatial and temporal modeling is required. Inter-channels links 
are taken into account in the spatial multivariate model, and shift-invariance is used for the temporal model. Multivariate learned 
kernels are informative (a few atoms code plentiful energy) and interpretable (the atoms can have a physiological meaning). Using 
real EEG data, the proposed method is shown to outperform the classical multichannel matching pursuit used with a Gabor dictio- 
nary, as measured by the representative power of the learned dictionary and its spatial flexibility. Moreover, dictionary learning can 
capture interpretable patterns: this ability is illustrated on real data, learning a P300 evoked potential. 
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Scalp electroencephalography (EEG) measures electrical ac- 
tivity produced by post-synaptic potentials of large neuronal as- 
semblies. Although this old medical imaging technique suffers 
from poor spatial resolution, EEG is still widely used in medi- 
cal contexts (e.g. sleep analysis, anesthesia and coma monitor- 
ing, encephalopathies) as well as entertainment and rehabilita- 
tion contexts (Brain-Computer Interfaces - BCI). EEG devices 
are relatively cheap compared to other imaging techniques (e.g. 
MEG, fMRI, PET), and they offer both high temporal resolu- 
tion (a short period of time between two acquisitions) and very 
low latency (a delay between the mental task and the recording 
on the electrodes). 

■ These featu res are of particular conce rn for the practitioner 
interested in (ISanei and Chambers! 120071) : 



Event-related potentials (ERPs) or evoked potentials: tran- 
sient electrical activity that results from external sensory 
stimulation (e.g. P300); 

Steady-state evoked potentials: oscillatory brain activity 
that results from repetitive external sensory stimulation; 
Event-related synchronizations/desynchronizations 
(ERS/ERD): oscillatory activity that results from involve- 
ment of a specialized part of the brain; for example, 
activation of the primary motor area, known as p. (8- 
13 Hz) or [i (13-30 Hz) bandpower synchronization or 
desynchronization, have been widely studied; 
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• Epileptic activity: transient electrical bursts of parts of the 
brain. 

While EEG devices are known to be able to record such afore- 
mentioned activities through the wide areas of the sensors, 
methodologists are usually necessary in EEG experiments. In- 
deed, they have to provide the practitioners with tools that can 
capture the temporal, frequential and spatial content of an EEG. 
Consequently, signal interpretation usually yields a represen- 
tation problem: which dictionary is able to best represent the 
information recorded in the EEG? 

Fourier and wavelets dictionaries allow spectral anal ysis of 
the signals through well-defined mathematical bases jDurkai 
20071 iMaliatL 120091) . although they show a lack of flexibility 
to represent the shape diversity of EEG patterns. The Gabor 
dictionary has also attracted high interest due to its temporal 
shift-invariance property. Nevertheless, it also suffers from 
a lack of flexibility to represent evoked poten tials and EEG 
bursts dNiedermever and Lopes da Silval 12004-1) . For example 
widely studied sleep activities such as spindles (centroparietal 
or frontal areas) consist of a complex EEG shape; as same 
epileptic activities such as inter-epileptic peaks are another ex- 
amples of repeatable and complexely sh aped cerebral activities 
(Nie dermever and Lopes da Silval 120041) . In these two cases 
practitioners should probably benefit from a custom-based dic- 
tionary approach over Fourier or wavelets dictionaries. While 
these approaches are based on a priori models of the data, re- 
cent methodological developments focus o n data-driven repre 



senta tions: dictionary learning algorithms (ITosic and Frossard 
201 lh . 



In EEG analysis, the spatial modeling consists of taking 
into account inter-channels links, and this has been done in 
sever al studies searching for more spatial flexibility (IDurka , 
2007). The EEG temporal modeling is more difficult. Some 
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approaches use a hypothesis of temporal stationarity and treat 
only the spatial aspect, but this brings about loss of informa- 
tion. Other approaches use the generic Gabor dictionary which 
is shift-invariant. But it remains difficult to learn an EEG 



dictionary that inte grates these two aspects (IJost et al.l 2005 



Hamner et al., 201 1). 



In this article, time-frequency analysis tools that are used to 
deal with EEG are first reviewed in Section |2] Then, tempo- 
ral modeling is proposed based on the shift-invariance, and a 
spatial model called multivariate to provide an efficient dictio- 
nary learning in Section[3] As validation, the multivariate meth- 
ods are applied to real EEG data, and then compared to other 
methods in Section [4] Finally, to show the interpretability of 
the learned kernels, methods are applied for learning the P300 
evoked potential. 

2. EEG analysis 

In this section, some of the classical signal processing tools 
that are applied to EEG data for time-frequency analysis are 
reviewed: monochannnel and multichannel sparse approxima- 
tions, shift-invariant dictionaries, and dictionary learning algo- 
rithms. 

2.1. Monochannel sparse approximation 

In this paragraph, the EEG analysis is provided indepen- 
dently for the different electrodes / channels, so it is called 
monochannel. A single channel signal y e of N tem- 
poral samples and a normalized dictionary <t> € R A ' xM com- 
posed of M time-frequency atoms \<p, 



ml m= \ 



are considered. The 
monochannel decomposition of the signal y is carried out on the 
dictionary <I> such that: 



y = <bx + e , 



(1) 



assuming x 



M for the coding coefficients, and e e M. N for 
the residual error. The approximation y is Ox. The dictionary 
is redundant since M > N, and thus the linear system of Eq. ([D 
is under-determined. Consequently, sparsity, smoothness or an- 
other constraint is needed to regularize the solution x. Consid- 
ering the constant K <sz M, the sparse approximation is written 
as: 



minJIy-OxH 7 s.t. \\x\\ <K , (2) 

with |.| for the Frobenius norm, and ||x|| for the (q pseudo- 
norm, defined as the number of nonzero elements of vector 
x. Th e well-known Matching Pursuit (MP) (Mallat a nd Zhan^. 
19931) tackles this difficult problem iteratively, but in a subopti- 
mal way. At iteration k, it iteratively selects the atom that is the 
most correlated to the residue e k ~ l 



m = arg max,, 



afl: 



(3) 



To carry out time-frequency analysis for EEG, MP is applied 



composed of several 



2.2. Multichannel sparse approximations 

Hereafter, the EEG signal y e IE 
channels c = 1..C are considered. The c ,h channel of the sig- 
nal y is denoted by y[c]. The following reviewed methods link 
these channels spatially with the multichannel model illustrated 
inFig.Q2left). 

A multichannel MP dGribonvalll2003h was set up using a spa- 
tial (or topographic) prior based on structured sparsity. This 
model is an extension of Eq. (Q), with y e R NxC , $ e R NxM and 
x e M MxC . The multichannel model linearly mixes an atom in 
the channels, each channel being characterized by a coefficient. 
The underlying assumption is that few EEG events are spatially 
spread over in all o f the different channels. 

In ([Durka, 2007), different multichannel selections are enu- 
merated: 



the original multichannel MP ( GribonvaH 2003), called 
MMP.l by Durka, selects the maximal energy such that: 



m k = arg max,,, ^ | 1 [c], 0,„) 



(4) 



the multichannel MP dDurka et all 120051) called MMP_2 
selects the maximal multichannel scalar product (also 
called average correlation): 



m = arg max,, 



(5) 



Due to absolute values, selection (0]l allows the atom (f>„, 
to be in-phase or in opposite phase with the component 
e k ~ 1 [c], contrary to the more constraining selection (|5) 
which prefers <p„, to be in-phase with e k ~ l [c] (thus giving 

the same polarities across channels). 

the multichannel MP dMatvsiak et all 120051) called 



MMP_3 selects the maximal energy as Eq. (0), but with 
complex coefficients that allows it to have varying phases: 
each channel c has its o wn phase tp c , as also studied in 

dGratkowski et alll2007h: 

• the multichannel MP ( Sieluzvcki et al. , 2009b), which will 
call MMP_4 selects the maximal multichannel scalar prod- 
uct as Eq. (|5), with complex coefficients that allows it to 
have a phase ip c for each channel. 

Note that other algorithms deal with EEG decomposition. 
For example, a mu ltichannel decomposition was proposed in 
(Koenia et al., 2001), but it was based on the method of frames 
dDaubechieslll988l) . 

2.3. Shift-invariant dictionaries 

The dictionary <5 used in the decomposition can have a par- 
ticular form. In the shift (also called translation or tempo ral)- 



( Durka and Blinowska|,|2001) with the Gabor parametric dictio- invariant model djost et all 12005: Bart helemv et all [2012). the 



nary, which is a generic dictionary that is widely used to study 
EEG signals. 



(A, B) = Tr(B H A) is the matrix scalar product. 



signal y is coded as a sum of a few structures, named kernels, 
that are characterized independently of their positions. The L 
shiftable kernels of the compact *P dictionary are replicated 
at all positions, to provide the M atoms of the <t> dictionary. 



2 



Kernels {t//i}f =i can have different lengths T/, so they are zero- 
padded. The N samples of the signal y, the residue e, the atoms 
<p m and the kernels if// are indexed by t. The subset 07 collects 
the translations r of the kernel iffi(t). For the few L kernels that 
generate all of the atoms, Eq. (Q} becomes: 



M 



y(f) = 2 x m 4> m {t) + e(t) 

m— 1 
L 

= J] Yj x, ' t ^ ~ t) + e(f) ■ 



(6) 
(7) 



1=\ rea- 1 



This model is also called convolutional, and as a result, the sig- 
nal y is approximated as a weighted sum of a few shiftable ker- 
nels if/i. It is thus adapted to overcome the latency variability 
(also called jitter or misalignment) of the events studied. 

Algorithms described in Section [272] are widely used with a 
Gabor dictionary for EEG (Durka, 2007). Its generic atoms 



^Gaboi- are parameterized as (Mallat, 2009): 



Gabor(0 = -7= ■ g[- — ) • COS ( 2nft + (f ) , 



(8) 



where g(t) = /3 ■ e~"'~ is a Gaussian window, (3 is a normalization 
factor, s is the scale, r is the shift parameter, a is the shift fac- 
tor, / is the frequency and tp is the phase used in MMP_3 and 
MMP_4 algorithms. Note that a multiscale Gabor dictionary is 
not properly shift-invariant since the shift facto r a, which de - 
pends on the dyadic scale s, is not equal to 1 (Mallat, 2009). 
The drawback of such a dictionary is that generic atoms intro- 
duce a priori for data analysis. 

2.4. Dictionary learning 

Recently, dictionary learning algorithms (DLAs) have 
allowed the learning of di ctionary atoms in a data- 



driven and unsupe r vised way (ILewicki and Seinowskil [1998 



Tosi c and Frossardl |201 lb . A set of iterations between sparse 
approximation and dictionary update provides learned atoms, 
which are no more generic but are adapted to the stud- 
ied data. Thus, learned dictionaries over come generic ones , 
show i ng better qualities for p rocessing ( Tosic and Frossardl 



201 It iBarthelemv et aU 120121) . Different algorithms deal 



with this problem: the method of optimal directions (MOD) 
(Eng an et all 12000) generali zed under the name iterative least- 
squares DLA (ILS- DLA) fengan et all l2007h. th e K-SVD 
dAharon et aill2006b. the online D LA dMairal et aill2010l) and 
others dTosic and Frossardl 1201 II) . 

Only two studies ha ve proposed to i nclude dictionary learn- 



ing for EEG data. In (iJost et all 120051) . the MoTIF algorithm 



which is a shift-invariant DLA, is applied to EEG. It thus leams 
a kernels dictionary, but only in a mon ochannel case, which 
does not consider the spatial aspe ct. In dHamner et all 1201 lb . 
the well-known K-SVD algorithm dAharon et alll2006b is used 
to carry out spatial or temporal EEG dictionary learning. The 
spatial learning is efficient and can be viewed as a general- 
izatio n of the N-Microstates algorithm dPascual-Marqui et all 



1995). Conversely, results of the temporal learning are not con- 
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Figure 1 : Illustration of the multichannel (left) and the multivariate (right) mod- 
els. 



The dictionary redundancy gives a more efficient rep- 
resentation than learning methods based on Principal 
Com ponent Analysis or I ndepe ndent Component Analy- 
sis dLewicki and Seinowskil 1998 ). Moreover, such methods 
provide only a base (with M = N), and are not adapted to cope 
with the shift-invariance required by the temporal variability of 
the EEG. 



2.5. Summary of the state of the art 

To sum up, the previous paragraph on dictionary learning has 
already shown the relevance of the shift-invariance to learn the 
EEG temporal atom. That is also why the parametric Gabor dic- 
tionary, which is quasi shift-invariant, is well-adapted f or such 
data and is widely used with the multichannel MPs (Durka, 
2007). In multichannel decompositions, the different MPs try to 
be flexible to match the spatial variability. The use of complex 
Gabor atoms adds a degree of freedom t hat improves the qualit y 



of the representation (or reconstruction) (Matvsi ak et al.l l2005). 



The proposed multivariate approach takes into account these 
two aspects in a dictionary learning approach: a relevant shift- 
invariant temporal model and a spatial flexibility which consid- 
ers all of the channels. 

3. The multivariate approach 

I n this section, the genera l multivariate approach introduced 



(Barthele mv et al. . 2012 ) is adapted to the context of the 



EEG. The underlying model is first detailed, and then the meth- 
ods are explained: multivariate orthogonal MP (M-OMP) and 
multivariate dictionary learning algorithm (M -DLA). 



Note that the name multivaria te is used i n ( Sieluz vcki et al _ 
2009a|) to design ate MMP_2 dDurkaetall 12005b applied to 



MEG data, and in dSieluzvcki et all l2009bb for its complex ex- 
tension M MP_4; they are totally di fferent from the methods ex- 
plained in (Bar thelemv et al.ll2012b . 



vincing, mainly because the shift-invariant model is not used. 



3.1. Multivariate model 

In the multivariate model, Eq. (Q} is kept, but with y e R A ' xC , 
<I> e Ji NxMxC and x e M M , and now considering the multiplica- 
tion <&x as an element-wise product along the dimension M. In 
the multichannel model, a same monochannel atom is linearly 
diffused in the different channels (imposing a rank-1 matrix). 
Whereas in the multivariate model illustrated in Fig. fright), 
each component has its own atom, forming a flexible multi- 
component atom, multiplied by one coefficient. The differences 
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between multichannel and multivariate models are detailed in 



(Barthelemvet al., 2012) 



3.2. Multivariate methods 

A brief description of the multivariate methods is given in 
this paragraph, as all of the computational details can be found 



in dBarthelemv et al., 2012). Note that these methods are de- 



scribed in a shift-invariant way in (Barthele mv et al l l2012h .but 
for more simplicity, a non-shift-invariant formalism is used in 
this se ction, with the at oms dictionary <t>. First, remark that the 
OMP dPati et all 1 1993b is an optimal version of the MP, as the 
provided coefficients vector x is the least-squares solution of 
Eq. (0, contrary to the MP. The multivariate OMP is the ex- 
tension of the OMP to the multivariate model described previ- 
ously. At the current iteration k, the algorithm selects the atom 
that produces the strongest decrease (in absolute value) in the 



mean square error (MSE) 



jfc-i 



Jc-l 



Denoting the current residue 



+ e , we have: 



jfc-i 



dx„, 



^ = 2 Tr( 0,/ e*- 1 ) = 2 (e^',0,,,) , 



(9) 



using the derivation rules of (Petersen and PedersenlEoOSl So, 
the selection step chooses the maximal multivariate scalar prod- 
uct: 



m = arg max., 



= arg max,, 



(e* \<Pm) I , 

c 



(10) 
(11) 



Consequently, selection (TlOb is the multivariate extension of se- 
lection J3J, with no more considerations about channels polari- 
ties as for selections (0]i and (0. 

Concerning the M-DLA, a training set of multivariate sig- 
nals {y p J is considered (the index p is added to the other 

variables). In M-DLA, each trial y p is treated one at a time. 
This is an online alternation between two steps: a multivariate 
sparse approximation and a multivariate dictionary update. The 
multivariate sparse approximation is carried out by M-OMP: 



x p = arg min v || y p - <$>x " s.t. ||*||„ < K , 



(12) 



and the multivariate dictionary update is based on maximum 
likelihood criterion (IQlshausen and Fieldl 1 19971) . on the as- 
sumption of Gaussian noise: 

$ = argmin^Hyp - <bx p || 2 s.t. V meN M , 110*11 = 1- (13) 

This dictionary update step is solved by a stochastic gradient 
descent. At the end of the M-DLA, the learned dictionary is 
adapted to the training set. In the following of this article, 
the presented multivariate methods are used in a shift-invariant 
way. 

Besides applying the M-DLA on high-noised data, the evo- 
lution of the kernels length has been improved (Experiment 1 
and 2) and specific EEG activities have been time-localized to 
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Figure 2: EEG signal y^i sampled at 250 Hz with C = 22 channels. 



favour the learning (Experiment 2). Moreover, only two hy- 
potheses are followed in this approach: EEG noise is a Gaus- 
sian additive noise, and EEG events can be considered as sta- 
tistically repeated following the same stimulus. There are no 
spatial or temporal assumptions made on the dictionary: the 
learning results are data-driven at most. 



4. Experiment 1: dictionary learning and decompositions 

Two databases with different numbers of electrodes are used 
to test the genericity of the proposed method. The following 
two experiments aim at showing that multivariate learned ker- 
nels are informative (Experiment 1) and interpretable (Exper- 
iment 2) in a low signal-to-noise (SNR) ratio context. In this 
first experiment, the algorithms presented are applied to EEG 
data. They are compared to other decomposition methods to 
highlight their model novelty and their representative perfor- 
mances. 

4.1. EEG data 

Real data are us ed in the following expe riments and compar- 
isons. Dataset 2a (Tan germann et al. , 2012 ) from BCI Compe- 
tition IV is considered. There are four classes of motor tasks, 
but they are not taken into account in this paper. EEG signals 
are sampled at 250 Hz using C — 22 channels. Compliance to 
our model is natural, as signals are organized into trials. A trial 
consists of N = 501 temporal samples, during which subject 
is asked to perform one among four specific motor tasks. Data 
come from 9 subjects, and the trials are divided in a training set 
and a testing set. Each set is composed of P = 288 signals. 
Raw data are filtered between 8 and 30 Hz (motor imagery con- 
cerns jj and (3 bands) and zero-padded. Data resulting from this 
preprocessing are the inputs of the M-DLA. The first samples 
of the EEG signal y p= \ are plotted in Fig. [2] 
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4.2. Models and comparisons 

M-DLA is applied to the training set of the first subject, and 
a dictionary of L — 20 kernels is learned with 100 iterations 0. 
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Figure 3: Real Gabor atom used with MMP.l orMMP_2: the temporal profiles 
of each channel (top) and the time-frequency visualization (bottom). 
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C — 22 channels. In Fig. [3] and [4] at the top, amplitudes 
x m x <Pm (ordinate) of one atom are represented as a function 
of samples (abscissa), and on the bottom, spectrograms in re- 
duced frequencies (ordinate) are represented as a function of 
samples ( abscissa). A G abor atom is plotted in Fig. [3] b ased o n 



MMP.l (Gribonval, 2003) or on MMP.2 (Durkaet al., 2005) 



which gives the same kind of atoms. Gabor atom parameters are 
randomly chosen. In Fig. [4] a Gabor atom is plotted ba sed on 



2007b 



MMP_3 (Matvsiaket al., 200 51: iGratkowski et al. 
MMP_4 dSieluzvcki et all l2009bh . Since this atom has a spe- 



cific phase for each channel, it is more adaptive than the first 
one. Nevertheless, in these two cases, the spectral content is 
identical in each channel. 

In Fig. [5] two learned multivariate kernels are plotted, I = 9 
(top) and / = 17 (bottom). Components of Fig. [3J top) are sim- 
ilar, that is adapted to fit data structured like signal yi around 
samples t - 175 of Fig. [2] Components of Fig. |5lbottom) are 
not so different: they appear to be continuously and smoothly 
distorted, which corresponds exactly to data structured like sig- 
nal yi around samples t = 75 of Fig. [2] Moreover, they 
have various spectral contents, as seen in Fig. [6] contrary 
to Gabor atoms, which look like monochannel filter banks 
(Fig-EJbottom) and|4fbottom)). In the multivariate model, each 
channel has its own profile, and so its own spectral content, 
which gives an excellent spatial adaptability. 
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Figure 4: Complex Gabor atom used with MMP_3 or MMP_4: the temporal 
profiles of each channel (top) and the time-frequency visualization (bottom). 
Only the phase of the different channels varies, not the spectral content. 

To show the novelty of the proposed multivariate model, 
the existing multicomponent dictionaries are compared, with 



2 During the DLA, the control of the kernels length is tricky due to the low 
signal-to-noise ratio of the data. For the original M-DLA, the kernels were 
first initialized on an arbitrary length T' , and they were then lengthened or 
shortened during the update steps, depending on the energy presence in their 
edges. Nevertheless, with these rough data, kernels tend to lengthened without 
stopping. So, a new control method is set up: a limit length T m borders the 
kernels over the first 2/3 of the iterations, and then, the border is fixed to T'" +40 
for the last iterations. This allows kernels to begin to converge, and to then have 
the possibility to obtain quasi-null edges, which avoids discontinuities in the 
latter decompositions using this dictionary. 



Figure 5: Kernels of the learned multivariate dictionary: kernel 1 = 9 (top) and 
kernel 1=17 (bottom). 



4.3. Decompositions and comparisons 

In this paragraph, the reconstructive power of the different 
dictionaries and multichannel sparse algorithms is evaluated. 

In a first round, the training set of the first subject is con- 
sidered. Learned dictionaries (LD) used with M-OMP, and 
a Gabor dictionary used with MMP.l, MMP_2, MMP_3 and 
MMP.4, are compared. The Gabor dictionary has M = 30720 
atoms. Two learned dictionaries are used: one with L — 20 ker- 
nels (learned in Section l4~2b . which gives M as 10000 atoms; 
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Figure 6: Spectrograms of three components of the kernel / = 17: component 
c= 10 (top), component c = 14 (middle), component c = 20 (bottom). 



and one with L — 60 which gives M as 30000 atoms, which is 
a size similar to the Gabor dictionary. For each case, ^-sparse 
approximations are computed on the training set, and the recon- 
struction rate p is then computed. This is defined as: 



(14) 



The rate p is represented as a function of K in Fig. UJ First, 
MMP.l (blue dash-dot line) is better than MMP_2 (blue dot- 
ted line), and MMP_3 (green solid line) is better than MMP_4 
(green dashed line), because the selection of Eq. (f5]l is more 
constraining than select ion of Eq. <j4j as well explained in 



(Barthele mv et al.L l2012). Then, MMP_3 andMMP_4 are better 



than the other MMP.l and MMP_2 due to their spatial flexibil- 
ity on phases atoms. Finally, learned dictionaries (black solid 
lines) are better than other approaches, even with three times 
fewer atoms. These two representations (LD with L = 20 and 
L = 60) are more compact since they are adapted to the stud- 
ied signals. If learned kernels code more energy than Gabor 
atoms, the learned dictionary takes more memory (for storage 
or transmission) than the parametric Gabor diction ary. Identi- 
cal results are observed in (IBarthelemv et al.i 1201 21) . but only in 
a monochannel case: the learned dictionary overcomes generic 
ones for sparse reconstructive power. 

The generalization is tested in a second round, determining if 
the adapted representation that was learned on the training set of 
the subject 1, remains efficient for other acquisitions and other 



subjects. Thus, the testing sets of the 9 subjects are now consid- 
ered. In the same way, /f-sparse approximations are computed 
with the LD (L = 60) on the different testing sets, and the rates 
p are plotted as a function of K in Fig. [8] The different curves 
are very similar and look like the curve of LD (black solid line 
with stars) in Fig. [7j that shows the intra-user and inter-user 
robustness of the learned representation. 

Since they have good representation properties, the learned 
dictionaries can be useful for EEG data simulation. More- 
over, as noted in dTosic and Frossard , 201 1 ), learned dictionar- 
ies overcome classical approaches for processing such as de- 
noising, etc. 
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Figure 7: Reconstruction rate p on the training set as a function of the sparsity 
K of the approximation of the different dictionaries and algorithms. 
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Figure 8: Reconstruction rate p of the M-OMP used with the LD (L = 60), as a 
function of the sparsity K of the approximation on the 9 testing sets. 
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5. Experiment 2: evoked potentials learning 

The previous experiment has shown the performances and 
the relevance of such adaptive representations. The question is 
to know if these learned kernels can capture behavioral struc- 
tures with physiological interpretations. To answer, we will fo- 
cus on the P300 evoked potential. 

5.7. P 300 data 

In this section, the experi ment is carried out on dataset 2b 
of the P300 speller paradigm dBlankertz et al.U2004 . from BCI 
Competition II. To sum up, a subject is exposed to visual stim- 
uli. When it is a target stimulus, a P300 evoked potential is 
provoked in the brain of the subject 300 ms after, contrary to a 
non-target stimulus. The difficulty is the low SNR of the P300 
evoked potentials. 

In this dataset, P = 1261 target stimuli are carried out. Con- 
sidering the complete acquisition Y e M. NxC of N samples, it is 

parsed in P epoched signals Jy p J of TV samples. D e M. NxN 
is a Toeplitz matrix, the first column of which is defined such 
that D(t p , 1) = 1, where t p is the onset of the pth target stimu- 
lus. Acquired signals have C = 64 channels and are sampled at 
240 Hz. They are filtered between 1 and 20 Hz by a 3rd order 
Butterworth filter. 

5.2. Review of models and methods 

There are two models for the P300 waveform. With 0p3oo e 
R A ' xC , the classical additive model can be written as: 



y - 0P3OO + e 



(15) 



and, with shift-invari ance and an amplitude, it gives a more flex- 
ible temporal model (iJaskowski and Verlegerl 119991) : 



y(t) = x ^P3ooO - t) + e(f) 



(16) 
1) 



Note that Eq. © is retrieved, but reduced to one kernel (L 
and in a multivariate case. 

One classical way for working on P300 is the dictio- 
nary design, which uses a template, i.e. a pre-determined 
pattern designed to fit a P300 waveform. An a pri- 
ori is thus injected through this prototype, generally with 
the shift-invariant model. For example , mon ochannel pat- 



terns as G aussian functions dLange 



19971) . time-limited 

sinosoids ( Jaskowski and Verlegeg 1 1999b . generic mass po- 
tentia ls dMelkonian et all 120031). Gamma functions dLi et al 



ge et all 
:rl 1 19991) 



2009) and Gabor functions dJorn et all 1201 II) have been used 
to match the P300 and other evo ked potentials. Multicha nnel 
patterns have been introduced in (Gratkowsk i et all 2008) us- 
ing Gabor temporal atoms and multichannel coefficients based 
on Bessel functions which try to model the spatial dependencies 
between channels. 

Another way is related to dictionary learning, which learns 
EEG patterns in a data-driven way. Different recent methods 
allow the learning of evoked potentials, but wit h monochannel 
( D'Avanzo et all 2011 : Nonclerca et al. , 2012 ) or multichan- 
nel patterns ( Wu and Gaol 201 1 ). Here, we are interested in 



learn ing multivariate patterns. Based on Eq. (Q3), dRivet et al 
2009) gives a least-squares (LS) estimation that takes into ac- 
count overlaps of consecutive target stimuli, defined as: 



^P3oo = arg min \\ Y - D(p \ 
= (D T D) 1 D T Y . 



(17) 



This estimation is optimal if there is no variability in latencies 
and amplitudes. Furthermore, without overlap, this estimation 
is equivalent to the grand average (GA) carried out on epoched 

signals {y P } ■■ 



= i D r Y= if 

P P Z-L 



y P 



(18) 



However, these two multivariate estimations make strong as- 
sumptions on latencies and amplitudes, which is a lack of flexi- 
bility. Based on model dTBT l. we propose to use dictionary learn- 
ing, which can be viewed as an iterative online least-squares 
estimation: 



min ^ ^ min , 



tfr(t - Tp) s.t. 



1, (19) 



with the variable r restricted to an interval around 300 ms. The 
estimated kernel is denoted by ^p3oo- 

Note that spatial filtering is not considered, which provides 
enhanced but proje cted signals (for example, the second part of 
dRivet et all 12009V ). 



5.3. Learning and qualitative comparisons 

M-DLA is applied to the training set , with K = 1. 

The grand average estimation 0p3oo is used for initialization, to 
provide a warm start, and the M-DLA is used on 20 iterations 
on the training set. The kernel length is T — 65 samples, which 
represents 270 ms, and it is constant during the learning. The 
optimal parameter r is searched only on an interval of 9 points 
centered around 300 ms after the target stimulus , which gives 
a latency tolerance of + 16.7 ms. 

To be compared to the kernel 1/^300, estimations 0p3oo and 
0P3OO are firstly limited to 65 samples and then normalized. 
Note that considered patterns have C — 64 channels and they 
are not spatially filtered to be enhanced. Multivariate tempo- 
ral patterns are plotted in Fig. [9] with 0p3oo estimated by grand 
average (a), 0p3oo estimated by least-squares (b) and 1^300 by 
multivariate dictionary learning (c). The amplitude is given in 
ordinate and the samples in abscissa. Patterns (a) and (b) are 



3 However, an edge effect is observed during the learning: temporal shifts 
T p of plentiful signals are localized on the interval edges. It means that the 
global maximum of the correlations has not be found in this interval, and t„ 
is a value by default. This can be due to the high level of noise that prevents 
the correlation from detecting the P300 position. Such signals will damage the 
kernel tAp3oo if they are used for the dictionary update, since the shift parameters 
are not optimal. To avoid this, the kernel update is not carried out for such 
signals. 
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Figure 10: Spatial patterns of the P300 computed by the grand average (a), by least-squares (b) and by multivariate dictionary learning (c). 



very similar, whereas kernel (c) is thinner and the components 
are in-phase. 

Associated spatial patterns are composed of the amplitudes 
of the temporal maximum of the patterns. They are then plot- 
ted in Fig. [TO] where (a) is the pattern estimated by the grand 
average, (b) by least-squares, and (c) by multivariate dictionary 
learning. We observe that, similarly to the temporal compar- 
ison, patterns (a) and (b) are quite similar. The topographic 
scalp (c) is smoother than the others and does not exhibit a sup- 
plementary component behind the head. But, as the true P300 
reference pattern is unknown, it is difficult to have a quantitative 
comparison between these patterns. 

5.4. Quantitative comparisons by analogy 

A solution consists in validating the previous case by anal- 
ogy with a simulation case. A P300 pattern previously learned 
with C — 64 channels is chosen to be the reference P300 of 
the simulation. P = 1000 signals are created using this refer- 
ence P300 with shift parameters drawn from a Gaussian distri- 



bution. Spatially cor related noise, reproduc ed with a FIR filter 
learned on EEG data ([Anderson et al.Ul998l) . is added to signals 



to give a signal-to-noise ratio of -10 dB. First, the GA estima- 
tion is computed for this dataset. Secondly, this estimation is 
used for the initialization of M-DLA, which is then applied to 
the dataset. This experimentation is carried out 50 times for 
different standard deviations of the shift parameters (given in 
samples number). The patterns estimated from these two ap- 
proaches are compared quantitatively, computing their maxi- 
mum correlations with the reference pattern. Since the patterns 
are normalized, the correlations absolute values are between 
and 1. 

Averaged correlations are plotted in Fig. [TT]as a function of 
the standard deviation of the shift parameters. If the recovery 
performances of GA and M-DLA are similar for small standard 
deviations, M-DLA is better than GA when the P300 patterns 
are widely shifted. This shows quantitatively that the shift- 
invariant dictionary learning is better than the grand average 
approach (equivalent to the LS estimation in this case, since 
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Figure 12: Temporal patterns of the P300: the reference (a), the grand average (b) and the multivariate dictionary learning (c). 



1 

0.98 
0.96 
0.94 
I 0.92 

m 

fi 0.9 



0.88 
0.88 
0.84 
0.82 



■ GA 

■ M-DLA 



I. 



i 



1 







5 10 
Standard deviation of shift parameters 



15 



Figure 1 1 : Correlations averaged over 50 experiments as a function of the stan- 
dard deviation of shift parameters, for the grand average (GA) and the multi- 
variate dictionary learning (M-DLA). 



there is no overlap between signals). 

Temporal patterns from one experiment are plotted in Fig. Q~2] 
with a standard deviation of cr — 6. Note that the reference 
P300 pattern (a) comes from learning of Section 15.31 but with 
an interval of 1. It is thus estimated with signals giving their 
maximum correlations at 300 ms exactly. First, we observe that 
averaging shifted patterns gives a spread estimated pattern (b) 
compared to the reference one (a). Indeed, the pattern (b) is 
the result of the convolution between the reference (a) and the 
Gaussian distribution of the shift parameters. Then, the M-DLA 
pattern (c) is thinner than (b). This confirms the results obtained 
for the correlations. By analogy, we can assume that the refer- 



ence P300 is a thin pattern, spread by the average of the shifted 
occurrences. The M-DLA allows a thinner pattern to be ex- 
tracted due to its shift-invariance flexibility. 

As shift-invariance is not easily integrated into EEG process- 
ing, the hypothesis of temporal st ationarity is of t en c arried out 
through the co variance matrix dBlankertz et al. , 12008b or the 



grand average (H amner et al.l 1201 lb . This experiment shows 



that this hypothesis is rough and provides a loss of temporal 
information. Although the shift-invariance model and the spa- 
tial flexibility of our approach is an obvious improvement, the 
goal is not to say that the learned P300 kernel is better than the 
otherfl but to present a new estimation method of EEG pat- 
terns, with the prospect to move forward with the knowledge 
of the P300, and to improve the processings based on the P300 
estimation. 

This experiment shows that learned kernels are interpretable. 
However, due to the high noise, in order to be interpreted with 
a physiological meaning, the dictionary learning algorithm has 
to time-localize activities of interest on intervals as presented 
in Experiment 2, contrary to Experiment 1. Note that the M- 
DLA can be applied to other kinds of evoked potentials, such 
as mismatch negativity and the N200, among others. 

5.5. Influence of the channels reference 

In the previous experiments, M-DLA seems to rather prefer 
patterns with in-phase components, contrary to GA and LS es- 
timations. Moreover, as observed in dMatvsiak et al. , 2005 ). the 
choice of the channels reference influences the components po- 
larities. To measure what are the real influences on the results, 
experiment of Section 1331 is reproduced with a different chan- 
nels reference. 



4 Moreover, between patterns plotted in Fig.|9}c) and !12l a), both estimated 
by M-DLA, we are not able to say which is the best. 
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Figure 14: Spatial patterns of the P300 computed by the grand average (a), by least-squares (b) and by multivariate dictionary learning (c) for average reference 
electrode. 



Data used in Section 1531 are linearly transformed to provide 
an average reference electrode configuration. P300 patterns are 
estimated by GA and LS, as well as by M-DLA which is applied 
with the same parameters. Multivariate temporal patterns of 
the P300 are plotted in Fig. [13] for the grand average (a), for 
least-squares (b) and for multivariate dictionary learning (c). 
We observe that M-DLA is able to learn a multivariate pattern 
with opposite phases. 

In the same way, spatial patterns of the P300 are plotted in 
Fig. [14] for the grand average (a), for least-squares (b) and for 
multivariate dictionary learning (c). We observe that the tan- 
gential source, causing opposite polarities behind the head and 
which was not extracted in Fig. fTOl c). is learned in Fig. [T4T c). 

6. Discussion and Conclusions 

After reviewing the classical time-frequency approaches for 
representing EEG signals, our dictionary-based method has 



been described. It is characterized by a spatial model, referred 
to as multivariate, that is very flexible, with one specific pro- 
file in each component, and with a shift-invariance used for 
the temporal model that is obviously pertinent for EEG data. 
This provides multivariate and shift-invariant temporal dictio- 
nary learning. Our approach has been shown to outperform 
the Gabor dictionaries in terms of their sparse representative 
power; i.e. the number of atoms necessary to represent a fixed 
percentage of the EEG signals. Specifically high-energy and re- 
peated patterns have been learned and the resulting dictionary 
has been shown to be robust to intra-user and inter-user vari- 
ability. Interestingly, the proposed approach has also been able 
to extract custom patterns in a very low signal-to-noise ratio 
context. This property is here demonstrated in the particular 
context of the P300 signals, which are repeated and approxi- 
mately time-localized. In the context of the EEG, the results 
obtained can be interpreted according to two distinct points of 
view. 
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First, EEG signal interpretation entails the analysis of huge 
amounts of multicomponent signals in the temporal domain. 
Consequently the best representation domain for neurophysi- 
ologists would be the ability to efficiently concentrate the infor- 
mation using a small number of active and informative compo- 
nents. In this sense, our sparse approach is shown to outperform 
classical approaches based on Gabor atoms. In other words, at a 
low and fixed number of active atoms, our method is able to bet- 
ter render the information available in initial EEG signals. This 
is also interesting for simulating multivariate EEG data. The is- 
sue of generating realistic multivariate EEG signals has indeed 
become recurrent over the past few years, to provide experi- 
mentally validated algorithms in a tightly controlled context. 
As shown in our first experiment, our approach can efficiently 
represent the diversity of EEG signals. Consequently, we be- 
lieve that it represents a relevant and competitive candidate for 
realistic EEG generation. 

Secondly, it should be kept in mind that strong a priori con- 
ditions are considered by methodologists when they are consid- 
ering pre-defined models based on generic dictionaries. While 
these assumptions can be accurate enough in the case of oscil- 
latory activities (e.g., Fourier, wavelets or Gabor), various EEG 
patterns cannot be efficiently represented through these dictio- 
naries. The flexibility of our approach relies on the fact that 
shiftable kernels are learned directly from data. This point is of 
particular interest for evoked potentials, or event-related poten- 
tials. To conclude, multivariate learned kernels are informative 
and interpretable, which is excellent for EEG analysis. 

For the prospects relating to a brain-computer interface 
(BCI), on the one hand, classical BCI methods can be im- 
proved taking into acc ount the shift flexibility. Then, on the 
other hand, as noted in (ITosic and FrossardluOl ll) . dictionaries 
learned with discriminative constraints are efficient for classi- 
fication. The future prospect will thus be to modify the pro- 
posed multivariate temporal method to give a spatio-temporal 
approac h for BCI. Finally, wavelet param eters learning meth- 
ods as (lYger and Rakotomamonivl EOlll) can be extended to 
the multicomponent case. 
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